AITopics | intermediate activation

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsFeb-9-2026, 20:27:57 GMT

aecad42329922dfc97eee948606e1f8e-AuthorFeedback.pdf

activation, alignment, linear interpolation, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.50)

Neural Information Processing SystemsDec-24-2025, 11:17:44 GMT

Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method.

fine-tuning, make pre-trained model reversible, memory efficient fine-tuning, (10 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Ayan Chakrabarti, Benjamin Moseley

Backprop with Approximate Activations for Memory-efficient Network Training

Neural Information Processing SystemsOct-2-2025, 11:41:49 GMT

Training convolutional neural network models is memory intensive since back-propagation requires storing activations of all intermediate layers.

activation, approximation, gradient, (15 more...)

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Missouri > St. Louis County > St. Louis (0.04)
North America > Canada (0.04)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Neural Information Processing SystemsAug-15-2025, 19:39:31 GMT

Tightness of bounds in Theorem 3.1 All reviewers

For the class of networks mentioned, the last inequality becomes trivial given we have MSE loss. Weights can be found for ReLU networks such that the other inequalities are tight. We discuss the use of post-activations in section C.2. From Figure Thus, the theorem applies to a successful method. To the reviewer's other point, we also note that ReLU is Lipschitz continuous with constant 1. Thus, we believe the algorithm can be widely applied.

activation, alignment, theorem 3, (14 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.50)

arXiv.org Artificial IntelligenceAug-12-2025

PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning

Yang, Xingke, Li, Liang, Wan, Zhiyi, Li, Sicong, Qi, Xiaoqi, Liu, Jiang, Ohtsuki, Tomoaki, Fu, Xin, Pan, Miao

There is a huge gap between numerous intriguing applications fostered by on-device large language model (LLM) fine-tuning (FT) from fresh mobile data and the limited resources of a mobile device. While existing server-assisted methods (e.g., split learning or side-tuning) may enable LLM FT on the local mobile device, they suffer from heavy communication burdens of activation transmissions, and may disclose data and labels to the server. To address those issues, we develop PAE MobiLLM, a a privacy-aware and efficient LLM FT method which can be deployed on the mobile device via server-assisted additive side-tuning. To further accelerate FT convergence and improve computing efficiency, PAE MobiLLM integrates activation caching on the server side, which allows the server to reuse historical activations and saves the mobile device from repeatedly computing forward passes for the recurring data samples. Besides, to reduce communication cost, PAE MobiLLM develops an activation shortcut that transmits only the token involved in the loss calculation instead of full activation matrices to guide the side network tuning. Last but not least, PAE MobiLLM introduces the additive adapter side-network design which makes the server train the adapter modules based on device-defined prediction differences rather than raw ground-truth labels. In this way, the server can only assist device-defined side-network computing, and learn nothing about data and labels. Extensive experimental results demonstrate PAE MobiLLM's superiority.

large language model, machine learning, natural language, (19 more...)

2507.01216

Country:

Asia > Japan (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (0.93)
Health & Medicine (0.93)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Jenhani, Zied, Bensalem, Mounir, Dizdarević, Jasenka, Jukan, Admela

An Experimental Study of Split-Learning TinyML on Ultra-Low-Power Edge/IoT Nodes

arXiv.org Artificial IntelligenceJul-23-2025

Running deep learning inference directly on ultra-low-power edge/IoT nodes has been limited by the tight memory and compute budgets of microcontrollers. Split learning (SL) addresses this limitation in which it executes part of the inference process on the sensor and off-loads the remainder to a companion device. In the context of constrained devices and the related impact of low-power, over-the-air transport protocols, the performance of split learning remains largely unexplored. TO the best of our knowledge, this paper presents the first end-to-end TinyML + SL testbed built on Espressif ESP32-S3 boards, designed to benchmark the over-the-air performance of split learning TinyML in edge/IoT environments. We benchmark the performance of a MobileNetV2 image recognition model, which is quantized to 8-bit integers, partitioned, and delivered to the nodes via over-the-air updates. The intermediate activations are exchanged through different wireless communication methods: ESP-NOW, BLE, and traditional UDP/IP and TCP/IP, enabling a head-to-head comparison on identical hardware. Measurements show that splitting the model after block_16_project_BN layer generates a 5.66 kB tensor that traverses the link in 3.2 ms, when UDP is used, achieving a steady-state round-trip latency of 5.8 s. ESP-NOW presents the most favorable RTT performance 3.7 s; BLE extends battery life further but increases latency beyond 10s.

artificial intelligence, intermediate activation, machine learning, (19 more...)

2507.16594

Country: Europe > Germany (0.04)

Genre:

Research Report > New Finding (0.50)
Research Report > Experimental Study (0.40)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Internet of Things (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Silveria, Austin, Govande, Soham V., Fu, Daniel Y.

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

arXiv.org Artificial IntelligenceJun-5-2025

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in high-quality image and video generation but incur substantial compute cost at inference. A common observation is that DiT latent noise vectors change slowly across inference steps, which suggests that the DiT compute may be redundant across steps. In this paper, we aim to speed up inference by reducing this redundancy, without additional training. We first study how activations change between steps in two state-of-the-art open-source DiTs. We find that just 5-25% of the values in attention and MLP explain 70-90% of the change in activations across steps. This finding motivates our approach, Chipmunk, which uses dynamic sparsity at inference time to recompute only the fastest-changing intermediate activations, while caching the rest. Dynamic sparsity introduces two systems challenges: (1) sparse attention and MLP operations tend to underutilize GPU tensor cores; and (2) computing dynamic sparsity patterns at runtime and caching activations both introduce overhead. To address these challenges, Chipmunk first uses a voxel-based reordering of input tokens to introduce column-wise sparsity. We implement column-sparse kernels utilizing efficient sparse gathers from global to shared GPU memory, achieving a 9.3x speedup at 93% sparsity compared to highly-optimized dense baselines. Second, Chipmunk overlaps the computation of sparsity patterns and cache updates with other parts of the computation (e.g., second layer of the MLP) to hide the extra latency. Chipmunk achieves up to 2.16x speedup on HunyuanVideo and 1.41x on FLUX.1-dev without compromising generation quality. Furthermore, we show that Chipmunk can be stacked on top of full step caching, achieving a 3.72x speedup on HunyuanVideo, a 2.67x speedup on WAN2.1, and a 2.25x speedup on FLUX.1-dev with minimal quality impact.

machine learning, natural language, sparsity, (18 more...)

2506.03275

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Workflow (0.68)
Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)

arXiv.org Artificial IntelligenceFeb-27-2025

MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning

Li, Liang, Yang, Xingke, Wu, Wen, Wang, Hao, Ohtsuki, Tomoaki, Fu, Xin, Pan, Miao, Shen, Xuemin

Large Language Model (LLM) at mobile devices and its potential applications never fail to fascinate. However, on-device LLM fine-tuning poses great challenges due to extremely high memory requirements and slow training speeds. Even with parameter-efficient fine-tuning (PEFT) methods that update only a small subset of parameters, resource-constrained mobile devices cannot afford them. In this paper, we propose MobiLLM to enable memory-efficient transformer LLM fine-tuning on a mobile device via server-assisted side-tuning. Particularly, MobiLLM allows the resource-constrained mobile device to retain merely a frozen backbone model, while offloading the memory and computation-intensive backpropagation of a trainable side-network to a high-performance server. Unlike existing fine-tuning methods that keep trainable parameters inside the frozen backbone, MobiLLM separates a set of parallel adapters from the backbone to create a backpropagation bypass, involving only one-way activation transfers from the mobile device to the server with low-width quantization during forward propagation. In this way, the data never leaves the mobile device while the device can remove backpropagation through the local backbone model and its forward propagation can be paralyzed with the server-side execution. Thus, MobiLLM preserves data privacy while significantly reducing the memory and computational burdens for LLM fine-tuning. Through extensive experiments, we demonstrate that MobiLLM can enable a resource-constrained mobile device, even a CPU-only one, to fine-tune LLMs and significantly reduce convergence time and memory usage.

fine-tuning, mobile device, mobillm, (15 more...)